16 research outputs found

    In-Datacenter Performance Analysis of a Tensor Processing Unit

    Full text link
    Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.Comment: 17 pages, 11 figures, 8 tables. To appear at the 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 24-28, 201

    Flexible Cache Error Protection using an ECC FIFO

    No full text
    We present ECC FIFO, a mechanism enabling two-tiered last-level cache error protection using an arbitrarily strong tier-2 code without increasing on-chip storage. Instead of adding redundant ECC information to each cache line, our ECC FIFO mechanism off-loads the extra information to off-chip DRAM. We augment each cache line with a tier-1 code, which provides error detection consuming limited resources. The redundancy required for strong protection is provided by a tier-2 code placed in off-chip memory. Because errors that require tier-2 correction are rare, the overhead of accessing DRAM is unimportant. We show how this method can save 15 − 25 % and 10 − 17 % of on-chip cache area and power respectively while minimally impacting performance, which decreases by 1 % on average across a range of scientific and consumer benchmarks

    Dynamic Power Supply Current Testing for Open Defects in CMOS SRAMs

    No full text
    The detection of open defects in CMOS SRAM has been a time consuming process. This paper proposes a new dynamic power supply current testing method to detect open defects in CMOS SRAM cells. By monitoring a dynamic current pulse during a transition write operation or a read operation, open defects can be detected. In order to measure the dynamic power supply current pulse, a current monitoring circuit with low hardware overhead is developed. Using the sensor, the new testing method does not require any additional test sequence. The results show that the new test method is very efficient compared with other testing methods. Therefore, the new testing method is very attractive

    Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems

    No full text
    This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches
    corecore